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I.  Introduction 


Reliable  automatic  identification  of  aircrafts  from  radar  returns  is  of  great  inter¬ 
est  in  many  areas  of  our  modem  society.  During  the  last  decade  significant  progress 
has  been  made  in  developing  radar  systems  that  are  capable  of  this  task,  both 
theoretically  and  experimentally.  For  example,  it  has  been  shown,  that  the  exact 
shape  of  an  object  can  be  reconstructed  if  radar  return  from  it  are  available  over 
an  unlimited  range  of  frequencies  and  observation  angles,  conditions  which  cannot 
be  met  in  practice.  It  has  also  been  shown  [1],  that  the  frequency  range  (Rayleigh 
range)  corresponding  to  wavelengths  from  half  the  size  of  the  object  to  10  times  its 
maximiun  dimension  carries  the  essential  information  regarding  its  overall  dimen¬ 
sion  and  approximate  shape.  In  practical  situations  it  is  desirable  to  measure  the 
target  at  a  limited  number  of  frequencies  to  reduce  the  cost  and  complexity  of  the 
radar  system.  Studies  have  shown  [2]  that  some  frequencies  are  more  effective  for 
classification  of  aircrafts  than  other. 

In  this  paper,  a  criterion  of  discriminant  analysis  is  applied  for  frequency  selec¬ 
tion,  to  a  large  scale  data  bzise  of  raxiar  return  measurements  from  models  of  five 
commercial  aircrafts.  Our  goal  is  to  characterize  the  optimum  sets  of  frequencies 
which  minimize  the  cleissification  error. 

Scaled  data  is  available  for  each  plane  at  0®  elevation  angle,  from  0®  to  180® 
azimuth  angle  in  10®  steps,  over  a  frequency  range  from  8  MHz  to  58  MHz,  in  1  MHz 
steps,  using  HHP  (  horizontally  transmitted,  horizontally  received  polarization)  and 
coherent  detection.  Therefore,  e^K:h  of  the  5  cleisses  is  represented  by  19  prototypes, 
each  of  which  is  a  vector  of  51  complex  entries  (amplitude  and  phase).  The  whole 
measurement  set  has  been  arranged  in  a  data  base  in  which  the  measurements  are 
scaled  to  square  meter.  All  system  related  parameters  have  been  removed  from 
the  data.  The  prototypes  are  considered  to  give  exact  knowledge  of  the  classes  at 


corresponding  angles.  Once  the  optimum  set  of  M  frequencies  has  been  selected 
using  the  discriminant  criterion,  the  corresponding  frequencies  are  extracted  from 
the  data  base  to  produce  a  new  smaller  data  base  which  represents  the  prototypes 
in  the  final  feature  space,  the  dimension  of  the  feature  space  has  been  reduced  from 
N  (51)  toM  <N. 

To  estimate  the  performance  of  the  pattern  recognition  system,  the  measured 
radar  return  of  an  unknown  target  is  simulated.  The  simulation  process  needs  to 
be  simple  enough  to  be  implementable  on  a  computer,  but  still  complex  enough  to 
describe  all  the  most  important  phenomena  encountered  in  the  environment  of  the 
radar  system.  For  radar  target  identification  systems,  noise  can  be  cl2issified  into 
three  broad  classes  [3].  Noise  generated  by  components  within  the  radar  system, 
noise  resulting  from  additive  (linear)  sources  outside  the  system,  such  as  clutter  and 
atmospheric  and  extraterrestial  sources,  and  noise  characterized  by  multiplicative 
disturbances,  which  can  occur  within  or  outside  the  system.  Assuming  ideal  system, 
the  noise  is  modelled  as  an  additive  white  Gaussian  noise  on  the  sampled  output  of 
the  output  device  within  the  radar  system.  This  model  is  considered  to  be  complex 
enough  to  represent  the  physical  phenomena  and  yet  parametrically  simple  enough 
to  be  of  use  in  simiUation  and  analysis.  It  is  assumed  that  the  noise  processes  at 
each  frequency  are  independent  identically  distributed  Gaussian  random  processes. 
A  noisy  radar  return  of  prototype  j  from  cltiss  /  would  then  result  in  the  observation, 
Xm  =  xij  +  n,  where,  xij  =  and  n  is  a  vector  of  M  complex 

independent  identically  distributed  Gaussian  random  processes.  Each  component 
of  n,  rii  is  generated  by  forming  the  complex  sum  n,  =  Ai  +  jBi,  where  A,  and  B, 

zero  mean,  and  o'*/2  variamcf  Gaussian  rzmdom  processes.  The  noise  model  is 
thus  completed  by  the  specification  of  the  variance  a*  of  the  individual  complex 
Gaussian  deviates  rii.  The  average  noise  power  is  specified  in  absolute  terms  in 
units  of  dBm^  [3]. 


Classification  of  an  unknown  target  is  done  by  means  of  the  nearest  neighbor 
(NN)  decision  rule.  If  the  distance  between  the  unknown  target  and  one  of  the 
prototypes  is  shorter  than  that  to  all  the  other  prototypes,  the  unknown  target  is 
given  the  same  class  identity  as  that  closest  prototype  has.  Thus,  a  classification 
error  is  declared  only  if  wrong  class  is  chosen. 

The  system  performance  of  the  selected  optimum  sets  of  frequencies  is  estimated 
at  a  given  noise  level  by  superimposing  noise  to  each  component  of  8dl  selected 
frequencies.  This  is  done  for  all  prototypes  from  all  the  classes.  Then  each  of  these 
corrupted  prototypes  is  assumed  to  be  an  unknown  target  and  is  classified  according 
to  the  NN  decision  rule. 

In  the  next  section  we  will  discuss  the  discriminant  criterion  used  for  the  feature 
selection,  and  then  some  of  the  resulting  optimum  sets  of  frequencies  are  presented 
along  with  their  classification  performance  curves. 

II.  Discriminant  Analysis 

In  general,  for  multiple  classes,  each  class  of  multiple  prototypes,  where  multi- 

I 

dimensional  measurements  are  available  of  each  prototype,  it  is  very  hard  to  relate 
a  set  of  features  to  the  probability  of  classification  error.  The  underlying  probability 
distributions  of  the  classes  must  be  known  to  find  such  a  relationship,  something 
which  is  unlikely  to  be  available  in  the  real  world.  As  it  is  extremely  difficult  to  find 
the  set  of  features  which  minimizes  the  classification  error,  so  called  discriminant 
functions  are  widely  used  to  obtain  more  reliable  set  of  features  than  by  picking 
them  at  random.  In  current  litterature  many  different  criteria  have  been  proposed 
for  discriminating  between  classes  [4,5,6,7,8,9,10,11,12,13].  Criteria  of  this  type 
do  not  utilize  any  information  about  the  measurement  noise.  Instead  they  use  dis- 


tance  measure  to  relate  similarities  within  classes  2ind  resolve  dissimilarities  between 
classes,  thus  selecting  a  set  of  features  which  maximizes  in  some  sense  the  average 
distance  between  the  classes.  Justification  for  the  use  of  such  criteria  is  based  on 
the  assvunption  that,  the  more  distant  the  classes  are  at  the  average,  the  smaller  is 
the  average  classification  error.  This  assumption  does  though  not  guarantee  that 
the  selected  set  of  features  gives  classification  error  close  to  the  obtainable  minimum 
using  same  number  of  features.  Also,  the  optimum  set  may  be  dependent  on  the 
noise  level.  Papers  have  appeared,  in  which  the  optimum  number  of  features  are 
discussed  [14,15,16,17,18,19],  but  there  does  not  exist  any  general  theory  of  what 
number  of  features  are  needed  to  establish  some  minimum  average  probability  of 
error  at  a  given  noise  level.  Hence  the  number  of  features  selected  is  choosen  ad-hoc, 
mostly  restricted  by  the  computational  effort  needed. 

In  this  section  we  describe  few  discriminant  criteria.  These  criteria  fimctions 
select  a  subspace  in  which  the  classes  separate  optimally  in  the  sense  of  the  corre¬ 
sponding  criteria.  Denote  the  distance,  d(xi^k^Xj^i)  between  the  ^-th  prototype, 
from  class  i  and  the  l-th.  prototype,  Xj,i  from  class  j  as 

~  {^i,k  ~  ^j,l)  {^i,k  ~  ^j,l)  i  (l) 


where  the  asterisk  denotes  conjugate  transpose. 

A  simple  criterion  function  based  on  this  distance  measure  for  the  n,  prototypes 
in  class  t,  with  Pi  denoting  the  a  priori  probability  of  class  i  for  1  <  i  <  L  is  given 
in  [5]  as 


1  L  L  /  1  \ 

-^1=1  fc=W=i 


which  is  the  average  probable  distance  between  all  prototypes  of  all  L  classes  in  the 
set,  A”,  of  prototypes. 


The  goal  of  the  feature  selection  process  is  to  identify  a  subset,  X,  of  feature 
vectors  in  the  set,  X,  of  possible  measurement  vectors  so  as  to  maximize  the  chosen 
criterion  function.  That  is,  the  optimum  set,  Z  of  features  is  that  which  maximizes 
Ji(2),  i.e. 

2  =  argmax  Ji(A')  (3) 

X 

In  order  to  pursue  the  implications  of  the  criterion  function  given  above,  notice 

that  (2)  can  be  written  as 
L 

Ji{X)  =  51  Pi  •  (m,  -  my  {mi  -  m)  + 

1=1 

H  Pi  •  (~)  -  ^iy(^i,k  “  ^i)  >  (4a) 

where  m,  denotes  the  vector  average  of  the  n,  prototypes  in  class  i  and  m  denotes 
the  vector  average  of  all  prototypes  in  the  set  X. 

The  first  term  appearing  in  (4a)  represents  the  distance  of  the  individual  class 
means  mi  to  the  sample  mean  m  and  is  referred  to  as  the  between  class  or  inter- class 
distamce.  The  inter-class  distance  provides  a  measure  of  the  separation  between  the 
“centers”  of  the  different  prototype  classes.  Clearly,  it  is  desirable  to  choose  features 
so  that  the  inter-class  distance  is  as  large  as  possible. 

The  term  appearing  as  the  second  sununation  in  (4a)  represents  the  average  of 
the  distances  of  the  prototypes  in  class  i  to  the  center  or  class  mean  mi  of  class 
i.  This  distance,  referred  to  as  the  average  within-class  or  intra-class  distance 
characterizes  the  degree  of  clustering  of  each  of  the  L  classes.  In  contrast  to  the 
inter-class  distance,  it  is  generally  desirable  to  choose  features  so  as  to  minimize 
the  average  intra-class  distance. 

In  order  to  treat  these  terms  separately,  define  the  inter-class  scatter  (or  co- 
variance)  matrix  for  the  set  X  of  prototypes,  which  is  a  measure  of  the  separation 
between  classes,  as 


(5) 


=  ~  (5) 

«=i 

and  the  average  intra-class  scatter  matrix,  which  is  a  measure  of  clustering  within 
cleisses,  as 

Sc  =  J2^i~Yi(^i>k-miy{xi^k-mi)  .  (6) 

Then  it  is  easy  to  see  that  the  criterion  function  Ji  in  (2)  can  be  written  as 

Ji(A')  =  tr(5, -f  5e)  .  (7) 

Unfortimately,  this  criterion  function  does  not  significantly  enhance  the  clas- 
sifiability  of  the  prototypes  in  the  resulting  optimized  feature  set.  In  particular, 
notice  that  if  either  the  inter-class  or  the  intra-class  distance  is  large,  then  J\{X) 
is  also  Izurge;  the  criterion  function  given  by  (2)  does  not  produce  the  desired  result 
of  minimizing  the  intra-class  distance  while  maximizing  the  inter-class  distance. 
Hence,  the  criterion  fimction  J\{X)  gives  little  indication  about  the  separability  of 
the  target  classes. 

A  feature  selection  criterion  function  of  the  discriminant  type  that  does  not 
suffer  from  the  shortcomings  discussed  above  is  given  as  in  [5]  by 


J2{X)  = 


tr(5,) 
tr(5<,)  ’ 


which  gives  the  ratio  of  the  average  inter-class  distance  to  the  average  intra-class 
distance.  This  criterion  fimction  is  intuitively  of  more  utility  for  the  radar  target 
classification  problem  since  this  function  increases  as  the  inter-class  distance  in¬ 
creases  relative  to  the  intra-class  distance,  or  as  the  intra-class  distance  decreases 
relative  to  the  inter-class  dist2mce.  Thus,  meiximizing  this  function  produces  a  set  of 
features  such  that  the  resulting  measurement  prototypes  separate  well  into  their  re¬ 
spective  clzisses.  The  principle  shortcoming  of  the  criterion  function  J2{X)  is  that  it 


V 
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is  possible  for  correlations  among  the  components  of  the  prototype  measurements  to 
skew  the  results  [5].  This  situation  can  be  rectified  by  preprocessing  the  prototype 
measurements. 

The  criterion  function  used  for  feature  selection  discussed  in  this  paper  is  the 
one  due  to  Wilks  [20,21],  which  is  based  on  the  ratio  of  the  “total  volume”  of  the 
feature  space  to  the  clustering  distance.  Sc-  It  is  given,  as 

where  St  =  Sc  +  5,  is  the  sum  of  the  average  intra-class  scatter  matrix  and  the 
average  inter-class  scatter  matrix.  The  covariance  matrix,  St  is  often  referred  to  as 
the  “total  scatter”  matrix. 

In  order  to  eliminate  the  possibility  of  skewed  results  due  to  correlations  among 
components  of  the  measurement  vectors,  the  measurement  prototypes  are  pre- 
processed  so  that  the  resulting  intra-class  covariances  are  unity,  and  the  total  scat¬ 
tering  covariance  matrix,  is  diagonal.  That  is,  the  measurement  prototypes  are 
transformed  by  a  mapping,  C  :  X  — *  y  so  that  the  components  of  the  resulting 
feature  vectors  are  imcorrelated.  Thus,  we  form  the  mapping  from  prototypes,  .T.jt 
in  the  measurement  space,  X  to  prototypes  y,,jt  in  the  feature  space  3^  as 

yi,k  =  Cxi,k  ,  (10) 

where  the  matrix  C  is  characterized  by  the  constraints  developed  below. 

If  we  denote  by  5,(y)  the  inter-class  scatter  matrix  for  the  prototypes,  in 
the  feature  space,  3^,  then  we  see  that 


‘S'.(y)  =  (^»(0  -  ^y)("^v(0  -  rriyY 

t=i 

L 

=  y~]P,  •  {Crrii  —  Cm)(Cmi  —  CfnY 


or 


t  tta  *af  ''a#  »*•-—»<  ‘ 


.a  ■>■  < 


=  C  Pi{mi  —  m)(mj  —  m)*|  C* 


5,(y)  =  C'S.(x)C’ 


Similaxly,  the  average  intra-class  scatter  matrix,  Sc{y)  for  the  feature  vectors, 
y.,*  is  given  by 


Sc(y)  =  CS,(x)C*  . 


Combining  these  two  results  yields 


Stiy)  =  5,(y)  -f-  5e(y)  =  C(5,(x)  -f  S,(x))C* 


Thus,  we  see  that  the  requisite  transformation,  C,  from  the  measurement  space,  A' 
to  the  feature  space,  3^  as  dictated  by  the  criterion  f\mction  J4{X)  in  (9)  may  be 
found  by  solving  the  resulting  set  of  equations  for  the  constraints  [4]: 


CSc{x)C*  =  I 


C5,(x)C*  =  A  , 


where  I  is  the  M  x  M  identity  and  A  is  an  M  x  M  diagonal  matrix. 

In  order  to  characterize  this  transformation,  following  the  procedure  given  in 


[4],  we  define  a  matrix,  Si{x)  such  that 


shx){sl{x)y  =  Scix)  , 


ScMx)  =  (5c’(x))- 


Then,  we  have  that 


mvjffuvuiuui  w 


WiWlWJtf -W^  V-V 


CSt{x)C^  =  (C5|(x))(S;*(x)5,(z)(5c  ^(x)r)(C5l(x))*  .  (18) 

Next,  we  define  a  matrix  F  such  that 

F=^CSl{x)  ,  (19) 

then  we  have  that 

CS,(x)C*  =  C5l(x)(Cs|(x))*  =  FF*  ,  (20) 

or  from  (14) 

FF*  =  /  .  (21) 

Combining  this  last  expression  with  (18)  gives 

CSt(x)C‘  =  FVF*  ,  (22) 

where  V  is  a  matrix  defined  by  the  relation 

V  =  sr^(x)s,(x)(s;’(x))*  .  (23) 

Finally,  notice  that  the  matrix,  5e(x)  can  be  written  as 

Sc{x)  =  EDE^  ,  (24) 


with 


FF’  =  I  ,  (25) 

where  F  is  a  column  matrix  of  the  orthonormeil  eigenvectors  of  5c(x),  and  F  is  a 
diagonal  matrix  of  the  eigenvalues  of  5c(x).  Thus,  we  have  that 

5c(x)  =  FFF*  =  FF^F2F* 

=  FF3F’(FF*F*)* 

and 
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s:\x)  =  Sc-"(x)(5;"(i))- 


This  gives 


Siix)  =  ED2E* 


-i  IP* 


5c  ^{x)  =  ED-^E 


As  a  result,  we  see  that  the  transformation  matrix  C  is  found  as  follows; 


1.  Find  the  decomposition 


5,(1)  =  EDE*  where  EE^  =  I 


2.  Find  Sc  ^(x)  using  the  relation 


5c"^(x)  =  ED-^E*  . 


3.  Calculate  the  matrix  V  as 


V  =  5c’^(x)5,(x)(5c‘^(x))* 


4.  Find  the  matrix  F  that  satisfies 


FVF*  =  A  or  y  =  F*AF  , 


where  FF*  =  I. 


5.  Calculate  the  transformation  matrix  C  as 


C'  =  F5c^(x)  . 


I 
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Thus,  we  see  from  (33)  that  the  transformation  C  from  the  measurement  space 
X  to  the  feature  space  y  consists  of  a  weighting  transformation,  Sc  ^(^),  and  an 
orthonormal  transformation,  F. 

The  weighting  transformation.  Sc  *(x)  is  obtained  as  the  “inverse  square-root” 
of  the  average  intra-class  scatter  matrix.  This  portion  of  the  transformation  matrix 
C  has  the  effect  of  minimizing  the  average  intra-class  distance. 

The  second  component  matrix  of  the  transformation  given  by  (33)  is,  by  defini¬ 
tion,  a  tmitary  transformation  and,  as  such,  merely  performs  a  rotation  operation. 
In  particular,  the  matrix,  F  is  chosen  so  that  average  intra-class  scatter  matrix  for 
the  feature  vectors  is  the  identity  matrix  and  the  sum  of  the  average  intra-class  and 
inter-class  scatter  matrices  is  a  diagonal  matrix  after  the  transformation. 

In  summary,  we  have  seen  that  the  transformation  matrix,  C  maps  the  measure¬ 
ment  space,  X  into  a  feature  space,  y  where  the  prototype  vectors,  j/^  *  have  intra¬ 
class  covariances  that  are  evenly  distributed  and  inter-class  covariances  that  are  zero 
between  classes.  Since  the  transformed  prototypes,  y,,*  from  different  classes  are 
uncorrelated,  the  separability  of  the  resulting  set  of  prototypes  can  be  considered 
on  a  component-wise  basis.  Prom  (15)  we  see  that  the  contributions  of  the  feature 
vector  components  to  the  optimization  are  additive.  Hence,  if  it  is  desired  to  find 
the  optimal  set  of  M  features  for  the  set  of  targets  of  interest,  it  is  necessary  only 
to  find  those  M  component  featm-es  that  individually  provide  maximum  separation 
between  the  target  classes  in  the  sense  of  the  criterion,  J4  given  in  (9). 

Notice  from  (9)  that  each  A*,  k  =  1,  •  •  • ,  M  is  an  eigenvalue  of  the  matrix 
Sc  *(x)St(i)(Sc  ^(x))*.  Also  notice  that 

57*(®)s,(x)(5;*(x))*  =  Sc~^(x)s,(x)(Sc~Hx)y  +  i  (34) 


Now  by  forming  the  transformation  matrix,  Q,  such  that  QQ*  =  I  and 

Q[5;*(x)5.(x)[5;*(a:)]*  +  /]Q*  =  A  +  I  ,  (35) 

where  A  is  an  M  x  M  diagonal  matrix  of  the  eigenvalues,  A;t»  of 
5r*(i)5,(x)(5r*(x))*,  then  we  have  that  [5] 

Afe  =  +  1  .  (36) 

Since,  by  definition,  Afc  >  0,  then  the  minimtim  value  of  A*  is  1. 

Intuitively,  we  see  that  each  eigenvalue,  Afc  gives  the  ratio  of  the  sum  of  the 
average  inter-class  and  intra-class  distances  to  the  average  intra-class  distance,  in 
the  direction  of  the  corresponding  eigenvector.  Thus,  the  constraint  CScC*  =  I 
imposed  on  the  mapping  C  has  the  effect  of  “normalizing”  the  distance  measure 
in  the  feature  space,  y  so  that  each  of  the  eigenvalues  give  an  absolute  indication 
of  how  well  the  classes  separate  in  the  corresponding  direction.  If,  for  example,  it 
happens  that  Ajt  is  large  for  a  certain  value  of  k  then  we  would  expect  the  classes  to 
separate  well  when  the  fc-th  component  feature  is  employed  for  classification.  On 
the  other  hand,  if  a  feature  component  has  an  eigenvalue  of  1,  then  the  average  inter¬ 
class  distance  is  no  larger  than  the  average  intra-class  distance  in  that  direction. 
This,  in  turn,  implies  that  the  target  classes  are  not  separable  in  the  direction  of 
the  corresponding  eigenvector. 

The  criterion  function,  J4  of  (9)  can  be  employed  either  as  the  basis  for  feature 
selection,  or  as  the  discriminant  function  for  the  feature  extraction  process.  If  the 
criterion  fimction  is  used  for  feature  selection,  then  F  becomes  an  M  x  M  matrix, 
with  row  vectors  that  are  the  eigenvectors  of  the  M-dimensional  subspace  producing 
the  largest  sum  of  M  eigenvalues.  Thus,  in  this  case,  the  transformation  matrix,  C 
is  M  X  M. 

If,  on  the  other  hand,  the  fimction,  J4  is  used  ^ls  a  criterion  for  feature  extraction. 


then  F  becomes  aa  M  xN  matrix,  with  row  vectors  that  are  the  eigenvectors  of  the 
corresponding  M  largest  eigenvalues  of  the  N  x  N  matrix  Sc  *(a;)S,(x)(Sc  *(x))*. 
In  this  case,  C  becomes  a  matrix  with  dimension  M  x  N . 

Finally,  we  point  out  that  while  several  other  discriminant-based  algorithms 
for  feature  selection  and  extraction  have  been  proposed  [6,22,23,12],  these  criteria 
place  little  emphasis  on  distributing  the  total  interclass  distance  equally  among 
the  classes  in  the  resulting  coordinate  system.  In  contrast,  it  is  unlikely  that  a 
feature  selection  or  extraction  process  based  on  J4  would  result  in  good  separability 
between  two  classes  in  the  target  set,  at  the  expense  of  separation  between  other 
classes  in  the  set  [5].  In  addition,  the  criterion  is  moderately  easy  to  implement 
and  possesses  many  of  the  properties  desirable  in  a  discriminant  function.  The 
performance  of  J4  should  give  a  good  indication  of  how  powerful  tool  discriminant 
analysis  is  for  feature  selection  for  composite  classes. 

Most  often,  discriminant  analysis  is  applied  to  simple  classes,  where  noisy  train¬ 
ing  samples  are  available  for  each  class.  In  this  case,  the  scatter  of  the  prototypes 
is  due  to  the  measurement  noise  instead  of  the  spread  of  many  exact  prototypes  of 
composite  cl^lsses.  Generally  speaking;  for  simple  classes,  where  only  noisy  train¬ 
ing  samples  are  available,  the  discriminant  algorithm  selects  a  subspace  where  the 
noise  level  is  low  compared  to  the  average  inter-cleiss  distance;  for  composite  classes, 
where  exact  information  of  the  subclasses  is  available,  the  discriminant  eilgorithm 
selects  a  subspace  where  the  average  inter-class  distemce  is  large  compared  to  the 
average  intra-class  distance.  For  a  simple  class,  the  mean  of  all  the  noisy  proto¬ 
types  (training  samples)  gives  moderately  good  representative  mcEin  vector  for  the 
class.  On  the  other  hand,  if  the  class  is  composite,  the  mean  vector  of  the  class 
prototypes  may  not  be  a  good  representative  for  the  class.  This  may  be  the  most 
severe  shortcoming  of  applying  discriminant  analysis  methods  to  composite  classes. 


III.  Results 


The  discriminant  criterion  described  above  was  applied  to  the  data  set  assinning 
either  no  prior  information  about  the  azimuth  angle  of  the  target,  or  assuming 
±20®  prior  directional  information.  The  optimum  set  of  M  frequencies  is  found 
by  selecting  every  possible  subset  of  M  frequencies  out  of  N  (51),  calculate  the 
criterion  value  and  select  the  subset  which  produces  highest  criterion  value.  As 
the  desired  ninnber  of  frequency  measurements  increases  (up  to  26),  the  number  of 
possible  subsets  increases  drastically.  Also,  as  the  number  of  features  is  increased 
the  computation  for  each  subset  takes  more  time  and  becomes  very  exhaustive  for 
ordinary  computer  system  in  terms  of  cpu  time. 

A.  No  Prior  Information  of  the  Azimuth  Angle 

For  coherent  detection  the  10  best  sets  of  1,  2,  3,  4  and  5  frequencies  were 
obtained.  The  result  of  this  search  for  the  optimum  set  of  4  frequencies  is  shown  in 
Table  1.  There  is  less  than  1%  difference  between  the  criteria  veilue  of  the  optimum 
set  and  the  value  of  the  10. th  best  set.  This  indicates  low  sensitivity  of  exactly 
which  set  of  frequencies  is  selected  as  the  optimum  one.  If  simulation  is  done  using 
the  optimum  set  and  the  10. th  best  set,  the  classification  result,  as  expected,  is 
very  similar.  If  the  number  of  desired  frequencies  exceeds  5  the  search  becomes 
extremely  exhaustive  and  time  consuming.  By  comparing  the  selection  result  for 
the  10  best  sets  of  frequencies  for  the  desired  number  of  frequencies  being  3,  4  and  5, 
there  is  high  tendency  to  select  some  of  the  same  frequencies  all  the  time.  Hence  to 
make  the  search  less  exhaustive  and  still  be  able  to  obtain  higher  number  of  desired 
frequencies,  4  of  the  frequencies  were  choosen  beforehand  and  the  feature  selection 
algorithm  then  used  to  search  those  4  more  frequencies  which  would  m^Lximize  the 
criteria  function,  resulting  in  the  “optimum”  8  frequencies.  The  optimum  sets  of 


1,  2,  3,  4,  5  and  “8”  frequencies  are  shown  in  Table  2.  There  is,  of  course,  no 
guarantee  that  this  set  of  8  frequencies  is  the  optimum  set  in  the  sense  of  the 
criteria  function,  but  very  likely  they  are  close  to  the  optimiun  8,  especially  as  there 
is  low  sensitivity  of  which  set  of  frequencies  it  the  optimum  one.  Figure  1  shows  the 
classification  result  using  the  optimum  set  of  4  frequencies  and  a  set  of  4  equally 
spaced  frequencies.  Clearly,  the  optimum  set  yields  better  performance  than  the  set 
of  equally  spaced  frequencies,  though  the  improvement  is  not  large.  Similar  results 
were  also  obtained  for  corresponding  sets  of  3,  5,  and  8  frequencies. 

In  Figure  2  the  classification  curves  for  the  optimum  set  of  3,  4,  5  and  “8” 
frequencies  are  compared  to  the  curve  obtained  when  using  all  the  51  frequencies. 
It  is  obvious  that  the  classification  result  is  much  better  if  all  the  51  frequencies 
eire  used.  This  is  not  very  surprising  as  all  the  eigenvalues  found  in  the  feature 
selection  algorithm  were  approximately  of  the  order  1.0  -  1.8,  which  means  that 
the  separation  in  any  direction  in  the  feature  space  is  of  the  same  magnitude  and 
the  classes  are  close  to  each  other  and  heavily  overlapping.  That  might  have  been 
expected  as  the  azimuth  angle  varies  from  0®  to  180®,  causing  a  huge  change  in  the 
effective  area  of  the  planes.  For  20%  probability  of  misclassification,  the  5  optimum 
frequencies  can  operate  in  around  22  dBm*  noise  level,  while  all  the  51  frequencies 
can  operate  in  around  32  dBm*  noise  level.  If  the  target  were  now  measured  at 
the  5  best  frequencies  10  times,  altogether  requiring  50  measurements,  and  the 
average  of  these  10  sets  of  measurements  used  for  the  classification  algorithm  ( 
SNR  in  the  meeisured  signal  has  been  increased  by  10  dB  ),  the  performance  of  this 
classification  scheme  is  just  sligthly  better  than  the  performance  of  the  classifier 
using  all  the  51  frequencies  measured  once.  This  implies  it  is  sufficient  to  measure 
the  target  only  at  the  optimum  set  of  5  frequencies  10  times  and  still  have  about  the 
same  information  about  the  target  eis  if  it  were  measured  at  all  the  51  measurements 
once.  The  classification  result  is  shown  in  Figure  3.  In  Figure  4  it  is  shown  how 


much  the  improvement  is  if  the  measurement  of  the  set  of  5  optimum  frequencies  is 
increased  by  one  at  the  time.  The  largest  improvement,  by  adding  a  measurement 
of  the  best  set,  is  when  the  best  set  is  measured  twice  instead  of  once.  As  the 
number  of  measurements  is  increased  the  less  impact  has  each  additional  set  of 
measurement  on  the  improvement  of  the  classifier,  i.e.  the  improvement  levels  off. 
The  same  tendency  can  also  be  observed  on  Figure  2  for  the  number  of  frequencies. 
There  is  about  the  same  improvement  going  from  4  frequencies  to  5  frequencies  as 
going  from  5  frequencies  to  8.  In  other  words  the  perform8uice  advantage  becomes 
less  emd  less  significant  as  the  number  of  frequencies  is  increased  or  the  number  of 
multiple  measurements  is  increased. 

B.  Partial  ±20®  Information  of  the  Azimuth  Angle 

Modem  radar  system  can  be  used  to  give  information  about  the  direction  of  the 
target.  Hence  it  is  realistic  to  consider  pattern  recognition  system  were  some  prior 
directional  information  of  the  target  is  available. 

Assuming  prior  information  of  the  azimuth  angle  of  the  target  to  be  within  ±20° 
uncertainity,  the  best  sets  of  4  frequencies  were  found  for  eeich  aizimuth  angle.  This 
is  done  by  finding  which  frequencies  discriminate  the  classes  best  at  each  azimuth 
angle,  ±20°.  This  will  result  in  19  sets  of  optimum  frequencies,  one  set  for  each 
azimuth  angle.  Now  if  an  unknown  target  were  being  measured,  its  approximate 
direction  (azimuth  angle)  would  need  to  be  estimated  and  then  the  target  would 
be  measured  at  the  optimum  set  of  frequencies  selected  for  that  azimuth  angle  and 
classified  by  compsuing  it  to  the  corresponding  noise  free  prototypes.  The  optimum 
sets  of  4  frequencies  found  at  every  azimuth  angle  eure  shown  in  Table  4.  It  is 
interesting  to  see  that  comparatively  lower  frequencies  are  selected  to  discriminate 
the  airplanes  at  the  broadside  than  at  azimuth  angles  closer  to  0°  or  180°.  The 
physical  interpretation  of  this  could  be  that  at  the  broadside,  lower  frequencies 
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carrying  size  information  give  best  discrimination,  while  higher  frequencies  carrying 
information  about  more  details  give  best  discrimination  at  0°  and  180°.  It  was 
observed  that  if  the  test  set  were  loaded  with  prototypes  at  azimuth  angle  i°  +  20° 
and  the  catalogue  set  with  prototypes  at  azimuth  angle  i°,  i°  ±  10°  and  i°  ±  20° 
at  the  frequencies  selected  for  azimuth  angle  i°,  that  the  classification  performance 
did  not  degrade  much  from  what  is  was  if  the  test  set  contained  prototypes  at  i° 
azimuth  angle.  This  means  that  if  a  target  is  estimated  to  be  at  i°  but  is  in  reality 
within  i  ±  20°  the  classification  algorithm  is  not  sensitive  because  of  this  inaccuracy 
in  the  azimuth  angle.  However,  this  does  not  state  anything  about  the  performance 
of  the  system  if  the  target  is  not  at  an  aspect  angle  which  is  not  a  multiple  of  10°. 

To  examine  what  is  gained  by  having  ±20°  prior  information  of  the  azimuth 
angle,  the  test  set  was  loaded  with  prototypes  at  angle  i,  at  the  opimum  4  frequencies 
found  when  no  prior  information  existed,  and  the  test  set  classified  to  the  whole 
catalogue  containing  every  azimuth  angle.  Then  the  test  set  were  loaded  with  the 
optimum  4  best  frequencies  at  angle  i°  given  ±20°  knowlegde  of  the  angle  and  the 
cataJouge  set  loaded  with  the  same  frequencies  at  angles  i°,  i°  ±  10°  and  i°  ±  20°. 
On  Figures  5-7  comparison  of  the  classification  curves  for  the  classifier  with  no 
a.zimuth  information  and  the  one  with  ±20°  prior  information  is  shown  for  0°,  90° 
and  180°.  Except  for  90°,  the  performance  of  the  classifier  with  the  prior  information 
is  superior  to  the  performance  of  the  one  with  no  information.  The  reason  for  this 
bad  result  at  the  broadside  might  be  that  there  is  largest  difference  between  the 
measurement  value  for  the  huge  airplanes  and  the  small  ones,  hence  the  selection 
algorithm  favours  just  the  discrimination  between  the  large  and  the  small  groups, 
resulting  in  bad  discrimination  within  each  size  group.  At  90°  the  optimum  set  was 
selected  as  the  four  lowest  frequencies.  But  overall,  there  is  significant  improvement 
in  the  classification  performance  given  the  prior  information  of  the  azimuth  angle. 
This  can  be  due  to  two  factors: 
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•  the  optimum  frequencies  for  each  azimuth  angle  do  give  better  discrimination 
between  the  classes  at  corresponding  azimuth  angle  than  the  optimum  set  of 
4  frequencies  do,  when  no  prior  information  existed. 
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•  the  improvement  is  due  to  the  decreased  number  of  prototypes  in  the  catalogue 
set  for  each  class  (  5  prototjrpes  instead  of  19  ). 

One  would  expect  the  improvement  to  be  be  due  to  both  of  these  factors.  To  find 
how  much  is  gained  by  each  factor,  prototypes  at  azimuth  angle  i  were  loaded  into 
the  test  set,  and  prototypes  at  azimuth  angle  t,  i  ±  10®  and  i  ±  20°  were  loaded  into 
the  catalogue  set  at  the  optimum  4  frequencies  found  assuming  no  prior  information 
about  the  azimuth  angle.  On  Figure  8  the  classification  curves  for  targets  at  30°  are 
compared  to  each  other,  using  the  optimum  4  frequencies  when  no  prior  information 
exists,  the  optimum  4  frequencies  with  ±20°  prior  information  and  the  optimum 
4  frequencies  when  no  prior  information  exists  but  only  using  prototypes  within 
±20°  from  the  real  azimuth  angle  in  the  catalogue  set.  Results  were  also  obtained 
for  every  30°  azimuth  angle.  It  was  observed,  that  at  the  average  the  improved 
performance  when  prior  azimuth  angle  information  exist  is  equally  due  to  the  prior 
information  of  the  azimuth  angle  and  the  decrease  of  the  munber  of  prototypes  in 
each  cleiss  from  19  to  5. 


rV.  Discussion 

The  classification  improvement  using  the  optimum  set  of  frequencies  compared 
to  the  set  of  equally  spaced  ones  is  rather  low.  There  are  probably  many  reasons  for 
this.  The  criterion  function  used  to  select  the  frequencies  does  not  bear  any  direct 
relation  to  the  probability  of  miscl^lSsification  criterion.  It  only  finds  the  frequencies 


at  which  the  classes  separate  best  in  a  noiseless  environment,  hopefully  implying 
that  the  farther  away  the  classes  are  at  the  average,  the  less  is  the  probability  of 
classification  error.  When  no  prior  information  of  the  azimuth  angle  existed  the 
criteria  value  is  rather  low.  This  indicates  that  there  is  not  large  separation  of  the 
classes  and  they  probably  overlap  each  other  heavily.  This  should  be  expected, 
as  the  azimuth  angle  is  chzinged  from  0°  to  180°,  the  effective  area  of  the  targets 
changes  very  much.  Hence,  a  small  airplane  on  the  broadside  may  look  similar  to  a 
large  airplane  under  some  other  angle.  Also,  there  is  small  variation  in  the  criteria 
value  for  the  10  best  sets  of  frequencies.  This  suggests  that  there  does  not  exist  a 
set  of  frequencies  which  can  discriminate  the  planes  much  better  than  any  other  set. 
The  5  classes  form  10  different  pairs  of  classes,  all  of  which  needs  to  be  separated 
from  each  other.  The  feature  selection  algorithm  were  used  for  each  pair  of  classes 
to  obtain  the  optimum  4  frequencies  which  separate  corresponding  pair  of  classes 
best.  The  result  can  be  found  in  Table  3  which  includes  the  optimum  4  frequencies 
for  all  the  classes.  Table  3  shows  that  none  of  the  optimum  4  frequencies  for  some 
of  the  pairs  is  in  common  with  any  of  the  optimum  4  frequencies  for  all  the  classes. 
Some  other  pairs  have  only  one  frequency  in  common  with  the  overall  optimum 
4  frequencies.  This  would  suggest  that  separation  between  corresponding  pairs  of 
classes  is  poor  and  worse  than  between  pmrs  of  classes  which  have  more  number 
of  optimum  frequencies  in  common.  Probably  the  main  weakness  of  this  criteria 
function  is  most  obviously  exposed  when  it  selected  the  optimum  4  frequencies  for 
±20°  prior  information  at  90°  azimuth  angle.  At  this  angle  the  area  difference 
between  the  large  and  small  ^li^pl^lnes  is  greatest,  giving  huge  separation  between 
the  group  of  large  planes  and  the  group  of  small  planes,  but  not  taking  into  account 
the  separation  between  airplanes  within  each  group,  i.e.  the  criteria  function  favours 
large  separation  between  some  classes  at  the  cost  of  small  separation  between  other 
classes.  In  Table  3  it  can  be  seen  that  for  90°  azimuth  angle  the  4  lowest  frequencies 


were  selected,  all  of  which  carry  mostly  size  information. 

In  summary,  the  optimum  sets  of  frequencies  characterized  by  the  discriminant 
criterion  yield  in  general  to  lower  classification  error  than  the  corresponding  sets  of 
equ2dly  spaced  frequencies. 
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Table  1:  Discriminant  analysis,  optimum  sets  of  4  frequencies,  coherent  measure¬ 
ments,  HHP. 
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.  .  .1 

4.67 

25.9 

Table  2:  Discriminant  analysis,  optimum  sets  of  1,2,3,4,5  and  8  frequencies,  coherent 
measurements,  HHP. 
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Table  3:  Discriminant  analysis,  optimum  set  of  4  frequencies  every  pair  of  airplanes, 
coherent  measurements,  HHP. 
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Table  4:  Discriminant  analysis,  azimuth  dependent  optimum  sets  of  4  frequencies, 
±20°  partial  azimuth  information,  HHP. 
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Classification  Error  [%] 


optimum  4  freq.  PaveB26.02 
4  eq.  sp.  freq.  Pave-26.30 


Figure  1:  Performance  of  the  optimum  set  of  4  frequencies  and  the  set  of  4  equally 
spaced  frequencies,  no  prior  information. 
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optimum  3  freq.  Pave>25.84 
optimum  4  freq.  Pave>26.02 
optimum  5  freq.  Pave-26.54 
optimum  8  freq.  Pave=»26.28 
all  51  freq.  Pave-27.62 
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Figure  2:  Performance  of  the  optimum  sets  of  3,  4,  5  and  ”8”  frequencies  and  the 
whole  measurement  set  of  51  frequencies  no  prior  information. 
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Figure  3:  Performance  of  the  optimum  set  of  5  frequencies,  the  optimum  set  of  5 
frequencies  measured  10  times  and  the  whole  measurement  set  of  51  frequencies,  no 
prior  information. 
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_  optimum  5  freq.  Pave-26.54 

_  2  X  opt.  5  freq. 

_  3  X  opt.  5  freq. 

_  4  X  opt.  5  freq. 

_ 5  X  opt.  5  freq. 

_  6  X  opt.  5  freq. 


7  X  opt.  5  freq. 

8  X  opt.  5  freq. 
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Figure  4:  Performance  of  the  optimum  set  of  5  frequencies  measured  1,  2,  3,  4,  5, 
6,  7,  8,  9  and  10  times,  no  prior  information. 


Classification  Error  [%] 


Figure  5:  Performance  of  the  optimum  set  of  4  frequencies  assuming  ±20°  prior 
information  and  the  optimum  set  of  4  frequencies  assuming  no  prior  information, 
target  at  0°. 
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